{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "39765d0e",
   "metadata": {},
   "source": [
    "# Ay 119: An Introduction to Deep Learning\n",
    "\n",
    "Matthew J. Graham, 2025\n",
    "\n",
    "The purpose of this notebook is to explore basic concepts in deep learning. I would also recommend an excellent recent review article on the history and use of neural networks in astronomy (https://doi.org/10.1098/rsos.221454). "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77aff7bb",
   "metadata": {},
   "source": [
    "## Biological neural nets\n",
    "\n",
    "The best example we have of a highly functional highly efficient neural network is the human brain which can perform the equivalent of an exaflop ($10^{18}$ operations per second) with just 20 Watts of power (although the measured processing speed from mental arithmetic is ~60 bits/sec). The Oak Ridge Frontier supercomputer requires 20 MW of power to perform the same computation. The fastest response from the brain is about 100ms and its storage capacity is estimated as somewhere between 1TB and 2.5PB. All this from something like:\n",
    "\n",
    "<img width=\"600\" src=\"img/Neuron.png\">\n",
    "\n",
    "In particular, the myelinated axon trunk allows action potentials to travel rapidly and efficiently along the axon through the process of saltatory conduction. The myelin sheath insulates the axon, enabling the electrical signal to jump between the Nodes of Ranvier, which speeds up nerve signal transmission and conserves energy. This process is vital for the fast and coordinated functioning of the nervous system, especially in large and complex organisms like humans. A key aspect is that the action potential, first generated at the axon hillock, only progates when the neuron's membrane potential reaches a threshold. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d0d216d",
   "metadata": {},
   "source": [
    "## The Perceptron\n",
    "\n",
    "The first computational model of a biological neuron was the MP neuron in 1943 by Warren McCulloch and Walter Pitts. Their original conception consisted of a set of binary inputs, $x_i$, and a single binary output, $y$: if $\\sum_i x_i  > \\Theta$ then $y = 1$, where $\\Theta$ is some threshold value. In 1958, Rosenblatt extended this model to create the perceptron which weights the inputs, $w_i x_i$ and then passes the sum of the products to an activation function, $f$, which determines the output: \n",
    "\n",
    "prediction or output $p = f({\\mathbf w} \\cdot {\\mathbf x} + b)$\n",
    "\n",
    "$b$ is a bias term that is added to the inputs (and is commonly denoted as $b = x_0 w_0$) to allow the neuron to shift its activation function linearly. \n",
    "\n",
    "<img width=\"750\" src='img/MLNeuron.png'>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90561833",
   "metadata": {},
   "source": [
    "### Activitation functions\n",
    "\n",
    "A number of activation functions beyond the Heaviside or step function have been proposed and/or are in common use:\n",
    "\n",
    "<img src='img/dl4.png'>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1137814",
   "metadata": {},
   "source": [
    "### Loss function\n",
    "\n",
    "For a given problem, we want to compare how well the output of the perceptron $p$ matches our known values $y$ and we do this with a loss function (also known as the objective function, the cost function, or the error function), which can be any differentiable measure. Let's consider the L2 loss so ${\\cal L} (y, p) = \\sum_i(p_i - y_i)^2$ and the change in the loss with a change is the weights is given by $\\frac{\\partial {\\cal L}}{\\partial {\\mathbf w}} = \\frac{\\partial {\\cal L}}{\\partial p} \\frac{\\partial p}{\\partial {\\mathbf w'}}$. There are many other possible loss functions, such as the L1 loss (absolute error). \n",
    "\n",
    "\n",
    "### Training or updating the weights\n",
    "\n",
    "Of course, as it stands, this model is not capable of learning or training, by which we mean here, finding the ideal set of weights to minimize the loss function. One way of doing this is by updating the weights iteratively to progressively reduce the loss function and hopefully converge on an optimal set. We can use $\\frac{\\partial {\\cal L}}{\\partial {\\mathbf w}}$ to navigate down the gradient of the loss function and moderate the amount by which we update the weights at each iteration by a factor known as the learning rate, so the update strategy becomes:\n",
    "\n",
    "${\\mathbf w}_{next} = {\\mathbf w} - \\eta \\frac{\\partial {\\cal L}}{\\partial {\\mathbf w}}$\n",
    "\n",
    "If you apply this to each training example then this is stochastic gradient descent. An alternative is batch gradient descent. Note that gradient descent will not work with a step function since the derivative is a Dirac delta function and the perceptron would never learn. The update rule in this case should be ${\\mathbf w}_{next} = {\\mathbf w} + \\eta {\\mathbf (y - p) x}$ (guaranteed by a convergence theorem proof). "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c264ca3",
   "metadata": {},
   "source": [
    "### Exercise 1: learning Boolean logic functions\n",
    "\n",
    "Recall the logic table for a Boolean AND gate/function:\n",
    "\n",
    "|x_1|x_2|p|\n",
    "|---|---|-|\n",
    "|0|0|0|\n",
    "|0|1|0|\n",
    "|1|0|0|\n",
    "|1|1|1|\n",
    "\n",
    "Assume a perceptron with inputs $x_1$ and $x_2$ and weights $w_1$ and $w_2$, and a bias level $x_0 = 1$ with weight $w_0$. Use a step function for the activation function. With an initial random set of weights and a small learning rate ($\\eta = 0.001$, say), run the training for 1000 epochs. \n",
    "\n",
    "Start manually and then move to computation. Plot the loss function against epoch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf61e1b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "*** CODE HERE ***"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03cf5405",
   "metadata": {},
   "source": [
    "Try an OR gate and an XOR gate."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b651d29",
   "metadata": {},
   "source": [
    "## Multilayer Perceptron (MLP)\n",
    "\n",
    "So what went wrong with the XOR function (hint: have a look at the value of the error function as a function of epoch)? The training never converges and actually never will as the function cannot be learnt via a single perceptron. The limitation is that the perceptron is a *linear* classifier and cannot work if the training set is not linearly separable (i.e., two classes cannot be separated by a hyperplane):\n",
    "\n",
    "<table>\n",
    "    <tr><td><img src='img/and.png'></td>\n",
    "        <td><img src='img/xor.png'></td></tr>\n",
    "</table>\n",
    "\n",
    "You can see that there is no straight line possible in the XOR plot that would separate the blue dots from the red ones.\n",
    "\n",
    "What can we do to the perceptron that would allow us to learn the XOR function? We require an intermediate *hidden* layer for a preliminary transformation to achieve the necessary logic:\n",
    "\n",
    "<img width=\"450\" src='img/xor_mlp.png'>\n",
    "\n",
    "It is also normal to use a sigmoid activation function instead of a step activation function: if a MLP has a linear activation function for all neurons then any number of layers can be reduced to a two-layer input-output model but a nonlinear activation function allows a MLP to calculate a nonlinearly separable function. The MLP is also known as a feedforward artifical neural network.\n",
    "\n",
    "\n",
    "### Universal approximation theorem\n",
    "\n",
    "In fact, a perceptron can perfectly emulate a NAND logic gate (you can verify this for yourself) and formally the singleton set {NAND} is functionally complete. Since a set of NAND gates can be combined to calculate any function, it must also be possible to combine a set of neurons (MLP) to calculate any function. More formal proofs exist that show that an infinitely wide (number of nodes) neural network or an infinitely deep (number of layers) neural network is a universal approximator or that \"a neural network with a single hidden layer and non-linear activation functions can represent *any* borel-measurable function\".\n",
    "\n",
    "<img width='750' src='img/img-mamc10.png'>\n",
    "\n",
    "There are a couple of other universal theorems that have become relevant to contemporary machine learning. The Kolmogorov-Arnold Representation Theorem states that a multivariate continuous function (on a bounded domain) can be written as a finite composition of continuous functions of a single variable and the binary operation of addition. There are also equivalent statements for graph neural networks (Brüel-Gabrielsson) and variational quantum circuits (Pérez-Salinas).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc31aabd",
   "metadata": {},
   "source": [
    "### Training the MLP: backpropagation\n",
    "\n",
    "However, just stacking a lot of MLPs together is not enough: we need a way to train such a network. It took about 20 years for the development of the now standard technique: backpropagation or reverse mode of automatic differentiation. Paul Werbos first formulated it for neural networks in 1974, but it was David Rumelhart, Geoffrey Hinton, and Ronald Williams in 1986 who popularized the method, leading to its widespread adoption in training neural networks.\n",
    "\n",
    "We've already considered how to train a single neuron (see above) and we can train a MLP with a very similar approach. One thing to be aware of is what the derivative of the activation function might look like: functions that are not continuous will be non-differentiable at particular values of the dependent variable and this could be problematic (unless the behavior can be constrained). For example, the ReLU activation function (shown above) is strictly non-differentiable at zero but the value of the derivative can be defined as 0 or 1 as appropriate. \n",
    "\n",
    "Conceptually, the idea of backpropagation is that we first forward propagate through every layer of the network to get the predicted value and calculate the value of the loss function. We then backpropagate this across every layer in reverse order updating the weights by the appropriate delta (update) as we proceed. So if a network has 5 layers, to calculate the backpropagated error for layer 1, we would first forward propagate from layer 1 => layer 2 => ... => layer 5 => output activation and then backpropagate from output activation => layer 5 => layer 4 => ... => layer 1 and then apply this delta to update the weights of every node in layer 1.\n",
    "\n",
    "<img width=\"500\" src=\"img/mlp.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6073cb8f",
   "metadata": {},
   "source": [
    "In layer $l$, the weighted inputs $z^{(l)}$ are given by: $z^{(l)} = {\\mathbf W^{(l-1)}} {\\mathbf a}^{(l-1)} + {\\mathbf b}^{(l-1)}$ and the activation ${\\mathbf a}^{(l)} = f(z^{(l)})$ with the first set of activations equal to the input values, ${\\mathbf a}^{(1)} = {\\mathbf x}$. The output value $s = {\\mathbf W^{(4)}}{\\mathbf a^{(4)}}$ and the loss function ${\\cal L}(s, y)$.\n",
    "\n",
    "Now for a single weight $w_{jk}^l$, the gradient will be:\n",
    "\n",
    "$ \\frac{\\partial {\\cal L}}{\\partial w_{jk}^l} = \\frac{\\partial {\\cal L}}{\\partial z_j^l} \\frac{\\partial z_j^l}{\\partial w_{jk}^l}$ where $z_j^l = \\sum_{k=1}^m (w_{jk}^l a_k^{l-1} + b_j^l)$ by definition and so $\\frac{\\partial z_j^l}{\\partial w_{jk}^l} = a^{l-1}_k$ and $ \\frac{\\partial {\\cal L}}{\\partial w_{jk}^l}  = \\frac{\\partial {\\cal L}}{\\partial z_j^l} a^{l-1}_k$ .\n",
    "\n",
    "Similar equations can be applied to $b_j^l$. \n",
    "\n",
    "So, for example, weight $w_{22}^{(2)}$ connects $a_{2}^{(2)}$ and $z_{2}^{(3)}$ so $\\frac{\\partial {\\cal L}}{\\partial w_{22}^{(2)}} = \\frac{\\partial {\\cal L}}{\\partial z_2^{(3)}} \\frac{\\partial z_2^{(3)}}{\\partial w_{22}^{(3)}} = \\frac{\\partial {\\cal L}}{\\partial a_2^{(3)}} \\frac{\\partial a_2^{(3)}}{\\partial z_{2}^{(3)}}a_2^{(2)} = \\frac{\\partial {\\cal L}}{\\partial a_2^{(3)}} f'(z_2^{(3)}) a_2^{(2)}$.\n",
    "\n",
    "There are different types of layer available in neural networks, for example, pooling layers and dropout layers, and rules exist for backpropagating the error across these as well."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcdcff2b",
   "metadata": {},
   "source": [
    "## Regularization\n",
    "\n",
    "Regularization is a set of techniques used to prevent overfitting in neural networks by lowering the complexity of a neural network during training.\n",
    "\n",
    "In Lasso and ridge regression, the loss function is modified by the L1 and L2 norm of the weight matrix, respectively. L1 regularization encourages the weight values to be zero where L2 regularization encourages them towards zero (but not exactly zero). Smaller weights reduces the impact of the hidden neurons and as they become neglected, the overall complexity of the neural network gets reduced. \n",
    "\n",
    "<table>\n",
    "<tr><td><img src='img/dl9.png'></td></tr>\n",
    "<tr><td><img src='img/dl10.png'></td></tr>\n",
    "<tr><td><img src='img/dl11.png'></td></tr>\n",
    "<tr><td><img src='img/dl12.png'></td></tr></table>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e561c4a9",
   "metadata": {},
   "source": [
    "## Normalization\n",
    "\n",
    "<img src=\"img/dl8.png\">\n",
    "<img src=\"img/dl13.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad9c3177",
   "metadata": {},
   "source": [
    "## Cost/loss functions\n",
    "\n",
    "Different functions are used for different specific tasks:\n",
    "\n",
    "<u>Regression</u>\n",
    "You want to predict a real value quantity: use one output node with a linear activation function and mean squared error (L2) as your loss function\n",
    "\n",
    "<u>Binary classification</u>\n",
    "You want to classify something as belonging to one of two classes: use one output node with a sigmoid activation unit and cross entropy (logarithmic loss) as your loss function\n",
    "\n",
    "<u>Multiclass classification</u>\n",
    "You want to classify something as belonging to one of more than two classes: use one output node per class with a softmax activation function and cross entropy as your loss function\n",
    "\n",
    "Cross entropy is a measure of the difference between probability distributions for a given random variable or set of events: $H(p, q) = -y \\log {\\hat y} - (1-y)\\log(1-{\\hat y})$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fda7fc61",
   "metadata": {},
   "source": [
    "## Optimizers\n",
    "\n",
    "We've already seen (stochastic) gradient descent as a means to optimize and it's a first-order optimization algorthim that is dependent on the first order derivative of a loss function. However, if the learning rate is too small then it make take ages to converge, all parameters have the same learning rate, and it may get trapped in local minima. \n",
    "\n",
    "There are a number of other optimizers but a commonly employed one is Adam (Adaptive Moment Estimation) which calculates an exponential moving average of the gradient and the squared gradient and parameters $\\beta_1$ and $\\beta_2$ to control the decay rates of these moving averages. This has the effect of an optimizer that is very fast and converges rapidly and avoids high variance but it can be computationally costly. Adam is considered state-of-the-art for deep learning."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be16f43f",
   "metadata": {},
   "source": [
    "## Different types of neural networks\n",
    "\n",
    "### Convolutional neural network (CNN)\n",
    "\n",
    "<img src=\"img/cnn.png\">\n",
    "\n",
    "### Autoencoder\n",
    "\n",
    "<img width=\"100%\" src=\"img/autoencoder.png\">\n",
    "\n",
    "### Recurrent neural network (RNN)\n",
    "\n",
    "<img src=\"img/rnn.png\">\n",
    "\n",
    "### Long short-term memory (LSTM)\n",
    "\n",
    "<img src=\"img/lstm.png\">\n",
    "\n",
    "### Gated recurrent unit (GRU)\n",
    "\n",
    "<img src=\"img/gru.png\">\n",
    "\n",
    "### Generative adversarial network (GAN)\n",
    "\n",
    "<img src=\"img/GANs.png\">"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}